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Abstract 



Our problem is to accurately solve linear systems of modest dimensions (typically, the num- 
ber of variables equals 32) on a general purpose graphics processing unit. The linear systems 
originate from the application of Newton's method on polynomial systems of (moderately) 
large degrees. Newton's method is applied as a corrector in a path following method, so the 
linear systems are solved in sequence and not simultaneously. One solution path may require 
the solution of thousands of linear systems. In previous work we reported good speedups with 
our implementation to evaluate and differentiate polynomial systems on the NVIDIA Tesla 
C2050. Although the cost of evaluation and differentiation often dominates the cost of linear 
system solving, because of the limited bandwidth of the communication between CPU and 
GPU, we cannot afford to send the linear system to the CPU for solving. 

Because of large degrees, the Jacobian matrix may contain extreme values, requiring ex- 
tended precision, leading to a significant overhead. This overhead of multiprecision arithmetic 
is an additional motivation to develop a massively parallel algorithm. To allow overdeter- 
mined linear systems we solve linear systems in the least squares sense, computing the QR 
decomposition of the matrix by the modified Gram-Schmidt algorithm. We describe our im- 
plementation of the modified Gram-Schmidt orthogonalization method for the NVIDIA Tesla 
C2050, using double double and quad double arithmetic. Our experimental results show that 
the achieved speedups are sufficiently high to compensate for the overhead of one extra level 
of precision. 

Key words and phrases, double double arithmetic, general purpose graphics processing 
unit, massively parallel algorithm, modified Gram-Schmidt method, orthogonalization, quad 
double arithmetic, quality up. 
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1 Introduction 



Homotopy continuation methods provide reliable numerical algorithms to compute all isolated 
complex solutions of a polynomial system. A homotopy is a family of polynomial systems, con- 
necting the system that has to be solved to a system with known solutions. Numerical con- 
tinuation or predictor-corrector methods are applied to track the solution paths starting at the 
known solutions to the solutions of the system that has to be solve. Introductions to homotopy 
continuation methods are in [2j and |l2j . For larger polynomial systems, it often happens that 
the double precision as provided by standard hardware is insufficient to achieve accurate results 
for some solution paths. In the problem setup for this paper we consider the tracking of one 
difficult solution path in extended precision. 

The extended precision arithmetic we perform with the quad double library QD 2 . 3 . 9 |7], and 
in particular on a GPU using the software at For the numerical properties, we refer to [5] 
and |19j . Our development of massively parallel algorithms is motivated by the desire to offset 
the extra cost of double double and quad double arithmetic. We strive for a quality up [l] factor: 
if we can afford to keep the execution time constant, by how much can we improve the quality of 
the solution? 

Using double double or quad double arithmetic we obtain uniform running times. In |24j 
we experimentally determined that the overhead factors of double double over standard double 
arithmetic is indeed similar to the overhead of complex over standard double arithmetic. Relevant 
to quality, the errors are expected to decrease proportionally to the increase in the precision. 
In |23j we described a multicore implementation of a path tracker and the methods used to 
evaluate and differentiate systems of polynomials were implemented on the NVIDIA Tesla C2050 
and described in |25] . The focus of this paper is on the solving of the linear systems, needed to 
run Newton's method. 

Because of the limited bandwidth of CPU/GPU communication we cannot afford to transfer 
the evaluated system and its Jacobian matrix from the GPU to the CPU and perform the linear 
system solving on the CPU. Although the evaluation and differentiation of a polynomial system 
often dominates the cost of Newton's method [23j, the cost of linear system solving increases rel- 
ative to the parallel run times of evaluation and differentiation so that even with minor speedups, 
using a parallel version of the linear system solver matters in the overall execution time. 

In the next section we state our problem, mention related work and list our contributions. In 
the third section we summarize the mathematical definition and properties of the modified Gram- 
Schmidt method and we illustrate the higher cost of complex multiprecision arithmetic. Then 
we describe our parallel version of the modified Gram-Schmidt algorithm and give computational 
results. 

2 Problem Statement and Related Work 

Our problem is to solve a linear system (which may have more equations than unknowns) on a 
GPU. The linear system occurs in the context of Newton's method applied to a polynomial system 
of modest size, e.g.: about 32 polynomials in 32 variables. The double precision as available in 
standard hardware is often insufficient to guarantee accurate results. Our goal is to offset the 
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extra cost of extended precision using a GPU. 

Because the system could have more equations than unknowns and because of increased 
numerical stability, we decided to solve the linear system with a least squares method via a QR 
decomposition of the matrix. The algorithm we decided to implement is the modified Gram- 
Schmidt algorithm, see for its definition and a discussion of its numerical stability. 

Our main challenge for the parallel implementation of the modified Gram-Schmidt orthogo- 
nalization method is the modest size of the dimensions. In our experimental study we typically 
work with 32 variables, although the computational cost increases with a double digit factor if 
we use complex double double and complex quad double arithmetic. 

2.1 Related Work 

Comparing QR with Householder transformations and with the modified Gram-Schmidt algo- 
rithm, the authors of [18j show that on message passing systems, a parallel modified Gram- 
Schmidt algorithm can be much more efficient than a parallel Householder algorithm, and in any 
case not slower. MPI implementations of three versions of Gram-Schmidt orthonormalizations 
are described in [13j. The performance of different parallel modified Gram-Schmidt algorithms on 
clusters is described in [20j. Because the modified Gram-Schmidt method cannot be expressed by 
Level-2 BLAS operations, in |26] the authors proposed an efficient implementation of the classical 
Gram-Schmidt method. 

In |16) is a description of a parallel QR with Gram-Schmidt on GPU and results on an 
implementation with the NVIDIA Geforce 295 are reported. A description on the performance of 
a high performance implementation of the QR algorithm on GPUs appeared in [9]. A report on 
QR decompositions on the NVIDIA Tesla C2050 can be found in [3J. The authors of [9] did not 
consider a fast implementation of the modified Gram-Schmidt method because the vectors in the 
inner products are large and the many synchronizations incur a prohibitive overhead. According 
to [9j, a blocked version is susceptible to precision problems. In our setting, the length n of the 
vectors is small (n = 32 coincides with the warp size) and similar to what is reported in [3], we 
expect the cost of synchronizations to be modest for a small number of threads. Because of our 
small dimensions, we do not consider a blocked version. 

In [1], the problem of solving many small independent QR factorizations on a GPU is inves- 
tigated. Although our QR factorizations are also small, in our application of Newton's method 
in the tracking of one solution path, the linear systems are not independent and must be solved 
in sequence. 

After the QR decomposition, we solve an upper triangular linear system. The solving of dense 
triangular systems on multicore and GPU accelerators is described in [21]. 

Related to polynomial system solving on a GPU, we mention two recent works. In [15], a 
subresultant method with a CUDA implementation of the FFT is described to solve systems of 
two variables. The implementation with CUDA of a multidimensional bisection algorithm on an 
NVIDIA GPGPU is presented in [IJ. 
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2.2 Our contributions 

Based on our computational experiments, our conclusions are twofold: 

1. Despite the low dimensions, we experimentally show that the extra cost of multiprecision 
arithmetic can be compensated by a GPU. 

2. Combined with projected speedups of our massively parallel evaluation and differentiation 
implementation [25], the results pave the way for a path tracker that runs entirely on a 
GPU. 

In this paper we apply the regular notion of speedup and compare the timings on the GPU 
with one core on the CPU for the same computations. When considering different levels of 
precision, a common practice is to compare the accuracy obtained at different precisions. Running 
independent calculations at different levels of precision is a pleasingly parallel computation which 
comes almost for free. 



3 Modified Gram- Schmidt Orthogonalization 

Roots of polynomial systems are typically complex and we calculate with complex numbers. 
Following notations in [6], the inner product of two complex vectors x, y G C" is denoted by 

n 

x^y, where denotes the Hermitian or conjugate transpose. In particular: x^y = ^^x^y^, 

e=i 

where c is the complex conjugate of c G C. 

In order to describe our parallel implementation, in Figure [J we state the pseudo code of the 
modified Gram-Schmidt orthogonalization method. 

Input: A £ C™^". 

Output: Q E C™^", R G C"^": Q^Q = I, 

R is upper triangular, and A = QR. 
let be column k of A 
for k from 1 to n do 

nk ■= \J^k^k 

qfc := ak/rkk, Qfc is column k of Q 
for j from + 1 to n do 

'''kj ■= <ik 

'■~ ~ ''^kj'ik 

Figure 1: The modified Gram-Schmidt orthogonalization algorithm. 



Given the QR decomposition of a matrix A, the system Ax = b is equivalent to QRx = b. 
By the orthogonality of Q, solving Ax. = b is reduced to the upper triangular system Rx = Q^h. 
This solution minimizes lib — Ax||n. 
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Instead of computing Q^h in a separate stage, for numerical stability as recommended in [HI 
jl9.3], one must apply the modified Gram-Schmidt method to the matrix A augmented with b: 



[A h] = [Q q 



n+l 



R 




y 

z 



\Rx 



(1) 

^ and 



As qn+i is orthogonal to the column space of Q, we have || 

y = Q^b. 

As reported in [8], the number of flops in the algorithm in Figure [1] equals 2mn?. The exper- 
imental results in Table [1] show the cubic behavior of the running time: doubling the dimension 
multiplies the user CPU time by a factor of (almost) eight. 



Table 1: User CPU times for 10,000 QR decompositions with double arithmetic on n random 
vectors of dimension n, for increasing values of n. 



n 


CPU time 


factor 


32 


3.7 sec 


1.0 


64 


28.7 sec 


7.8 


128 


225.4 sec 


7.9 


256 


1787.1 sec 


7.9 



The experiments reported in this and the next section were done on one core of an 3.47 Ghz 
Intel Xeon X5690 and with the code in version 2.3.70 of PHCpack [22] . 

4 complex and multiprecision arithmetic: cost and accuracy 

With user CPU times of runs with the modified Gram-Schmidt algorithm on random data in 
Table [2] we illustrate the overhead factor of using complex double, complex double double, and 
complex quad double arithmetic over standard double arithmetic. Considering absolute times, 
the 3.7 seconds increase to 788.3 seconds (more than 48 minutes) when going from double to 
complex quad double arithmetic. 

Table 2: User CPU times for 10,000 QR decompositions on 32 random complex vectors of dimen- 
sion 32, for increasing levels of precision. 



precision 


CPU time 


factor 


double 


3.7 sec 


1.0 


complex double 


26.8 sec 


7.2 


complex double double 


291.5 sec 


78.8 


complex quad double 


2916.8 sec 


788.3 



Using the cost factors we can recalibrate the dimension. Suppose a flop costs 8 times more, 
using 8 = 2^, the number of flops in the modified Gram-Schmidt method is then 8 x 2mn^ = 
2(2m)(2n)^. Working with operations that cost 8 times more increases the cost with the same 
factor as doubling the dimension in the original arithmetic. 
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Taking the cubed roots of the factors in Table [2j 7.2^/3 ^ 1.931, 78.8^/^ ^ 4.287, 788.3^/^ « 
9.238, we can say that the cost of using complex double, complex double double, and complex quad 
double arithmetic is equivalent to working in double arithmetic, but multiplying the dimension 
of the problem respectively by the factors 1.931, 4.287, and 9.238. Multiplying the original 
dimension 32 with 1.931, 4.187, and 9.238 respectively yields 62, 134, 296. So we may say that 
orthogonalizing 32 vectors in C^'^ in quad double arithmetic costs the same amount of work as 
orthogonalizing 296 vectors in C^^^. 

Next we compare the running times of the modified Gram-Schmidt algorithm with the running 
times for the LU factorization. In [8, page 174], LU factorization with the method of Doolittle on 
an m-by-n matrix (for m>n) has cost n'^{m — n/3) which for m = n equals 2n^/3. Not taking 
the cost of arithmetic into consideration, LU factorization is three times faster than the modified 
Gram-Schmidt algorithm. In Table El we take the user CPU times of Table [2] and compare these 
times with the same number of LU factorizations, for the same dimension and also for different 
kinds of arithmetic. 

Table 3: User CPU times for 10,000 QR decompositions with the modified Gram-Schmidt algo- 
rithm (MGS) and 10,000 LU factorizations on 32 random complex vectors of dimension 32, for 
increasing levels of precision. 



precision 


MGS time 


LU time 


factor 


double 


3.7 sec 


2.6 sec 


1.4 


complex double 


26.8 sec 


9.1 sec 


2.9 


complex double double 


291.5 sec 


95.6 sec 


3.1 


complex quad double 


2916.8 sec 


920.1 sec 


3.2 



The last column of Table [3] is the time of the modified Gram-Schmidt algorithm over the 
time of the LU factorization. Except for the calculations in standard double precision, the factor 
averages 3 and increases as the precision increases. An explanation for this increase could be that 
the square root computation costs more for longer numbers. 

Because the modified Gram-Schmidt algorithm takes three times longer than the LU fac- 
torization, the cost of linear system solving relative to the evaluation and differentiation of the 
polynomials grows significantly. This growth is significant enough that we can actually no longer 
regard the cost of linear system solving as being dominated by the evaluation and differentiation 
cost. 

To measure the accuracy of the computed Q G {[;;™x" and R G C"^" of a given A E {j^^-x"^ 
we consider the matrix 1-norm [6j of ^4 — QR: 



max 

i=l,2,...,m 
j=l,2,...,n 



(2) 



For X G [— 10,-|-10], we have x'^ £ [10-^,10+^^], so as the degrees d of the polynomials in 
our system increase we are likely to obtain more extreme values in the Jacobian matrix. In the 
experiments discussed below we generate complex numbers of modulus one as exp{i9), where 
i = \/— T and is chosen at random from [0,27r[. To generate complex numbers of varying 
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magnitude, we consider rexpiO, with r chosen at random from [10^^, 10^^] where g determines 
the range of the moduU of the generated complex numbers. To simulate the numbers in the 
Jacobian matrices arising from evaluating polynomials of degree d, it seems natural to take the 
parameter g equal to d. 

In Table H] experimental values for e are summarized. For complex numbers with moduli 
in [10^^, 10+^] we see that the decimal logarithm of e decreases linearly as g increases. The 
difference in the logarithm of the smallest and the largest error remains almost constant and 
increases slightly as g increases. For g = 16 the results are totally wrong with complex doubles, 
while with complex double double arithmetic we still obtain accurate answers of the usual double 
precision. For complex double double and complex quad double arithmetic, the experiment is 
continued in Table [5] for g ranging from 17 to 32. In this range, the precision of double double 
arithmetic becomes insufficient and quad double arithmetic is needed for accurate results. 

Table 4: For 1,000 QR decompositions on 32-by-32 matrices with randomly generated complex 
numbers of magnitudes uniformly distributed in [10~^, 10^^] and for g ranging from 1 to 16, we 
list rrie = min(log]^Q(e)), Me = max(logiQ(e)), and = rrie — Me, computed in complex double 
and complex double double arithmetic. 





complex double 


compl 


ex double double 


g 


rrie 


Me 


De 


me 


Me 




1 


-14.5 


-14.0 


0.5 


-30.6 


-30.1 


0.5 


2 


-13.7 


-13.0 


0.7 


-29.7 


-29.1 


0.6 


3 


-12.6 


-12.1 


0.5 


-28.7 


-28.1 


0.7 


4 


-11.7 


-11.0 


0.7 


-27.8 


-27.1 


0.7 


5 


-10.7 


-10.1 


0.7 


-26.9 


-26.1 


0.8 


6 


-9.8 


-9.1 


0.7 


-25.9 


-25.0 


0.8 


7 


-8.9 


-8.0 


0.9 


-24.9 


-24.1 


0.8 


8 


-7.8 


-7.0 


0.8 


-24.0 


-23.1 


1.0 


9 


-6.9 


-6.1 


0.8 


-23.0 


-22.2 


0.8 


10 


-5.9 


-5.0 


1.0 


-22.1 


-21.1 


1.0 


11 


-4.8 


-4.1 


0.7 


-21.1 


-20.1 


1.0 


12 


-3.9 


-3.1 


0.8 


-20.1 


-19.2 


0.9 


13 


-3.0 


-2.0 


1.0 


-19.3 


-18.2 


1.1 


14 


-2.0 


-1.0 


1.0 


-18.4 


-17.2 


1.2 


15 


-1.2 


-0.1 


1.1 


-17.3 


-16.2 


1.1 


16 


-0.2 


1.0 


1.2 


-16.4 


-15.1 


1.3 



Our experiments show that the numerical stability of the modified Gram-Schmidt method is 
good and predictable. If we do not want to lose more than half of the number of decimal places 
when working with complex numbers ranging in modulus between 10~^ and 10"*"^, then we must 
compute with a working precision of at least 2g decimal places. 
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Table 5: For 1,000 QR decompositions on 32-by-32 matrices with randomly generated complex 
numbers of magnitudes uniformly distributed in [10~^, 10+^] and for g ranging from 17 to 32, we 
list rrie = min(logio(e)), = max(logio(e)), and = rrie — M^, computed in complex double 
double and complex quad double arithmetic. 





compl 


ex double double 


compl 


ex quad double 


9 


me 


Me 


De 


rUe 


Me 


De 


17 


-15.5 


-14.1 


1.3 


-48.1 


-47.1 


1.0 


18 


-14.6 


-13.2 


1.4 


-47.1 


-46.2 


1.0 


19 


-13.6 


-12.0 


1.5 


-46.4 


-45.1 


1.2 


20 


-12.6 


-11.1 


1.5 


-45.1 


-44.2 


0.9 


21 


-11.4 


-10.2 


1.2 


-44.2 


-43.2 


1.0 


22 


-10.7 


-9.1 


1.5 


-43.3 


-42.2 


1.1 


23 


-9.8 


-8.1 


1.7 


-42.4 


-41.1 


1.3 


24 


-8.8 


-7.2 


1.6 


-41.3 


-40.2 


1.2 


25 


-7.8 


-6.3 


1.5 


-40.2 


-39.2 


1.0 


26 


-6.8 


-5.2 


1.6 


-39.3 


-38.1 


1.1 


27 


-5.7 


-4.2 


1.5 


-38.4 


-37.2 


1.2 


28 


-4.7 


-3.2 


1.5 


-37.7 


-36.1 


1.6 


29 


-3.7 


-2.2 


1.5 


-36.8 


-35.2 


1.6 


30 


-3.0 


-1.2 


1.8 


-35.7 


-34.2 


1.5 


31 


-1.8 


-0.2 


1.6 


-34.8 


-33.3 


1.5 


32 


-1.0 


0.8 


1.9 


-33.9 


-32.2 


1.8 



5 Massively Parallel Modified Gram- Schmidt Orthogonalization 

Our main kernel Normalize_Remove() in Gram-Schmidt orthogonalization normalizes a vector 
and removes components of all vectors with bigger indexes in the direction of this vector. The 
secondary kernel Normalize() only normalizes one vector. The algorithm in Figure [2] overwrites 
the input matrix A so that on return the matrix A equals the matrix Q of the algorithm in 
Figure [H 

The multiple blocks launched by the kernel within each iteration of the loop in the algorithm 
in Figure [2] is the first coarse grained level of parallelism. If we keep in mind that the number 
of variables equals 32, the number of cores on a multiprocessor of the GPU, then the second 
fine grained parallelism resides in the calculation of componentwise operations and of the inner 
products. Threads within blocks perform these operations cooperatively. As one inner product 
of two 32- vectors requires 32 multiplications (one operation per core), note that a multiplication 
in double double and quad double arithmetic requires many operations with hardware doubles. 

As the algorithm suggests the normalization of each is performed (n — k) times — by 
each of the blocks in the kih. launch of the kernel Nornialize_Remove(). However normalizing it 
only once instead would suggest another launch of the kernel Nornialize() associated with extra 
writing to and reading from the global memory of the card of the vector being normalized. This 
would be more expensive than to perform the normalization within Nornialize_Reniove() multiple 
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Input:A G C"'^", ^ = [ai aa ... a„], 

afc G C™, /c= 1,2, 
Output: A G C™^", = / (i.e.: A = Q), 

i?GC"^": i?= [rij], Tij gC, 
z = 1,2, . . . ,n, j = 1,2, . . . ,n. 
for k from 1 to n — 1 do 

launch kernel Normalize_Remove(A;) 
with (n — /c) blocks of threads, 
as the jih. block (for all j : k < j < n) 
normalizes a^ and removes the component 
of aj in the direction of a^ 
launch kernel Normalize(n) with one 
thread block to normalize a„. 

Figure 2: A parallel version of the modified Gram-Schmidt orthogonalization algorithm. 



times. 

The loop in the algorithm in Figure [2] performs n — 1 normalizations, where each normalization 
is followed by the update of all remaining vectors. In particular, after normalizing the A:th vector, 
we launch n — k blocks of m threads. Each thread block handles one aj. The update stage has 
a triangular structure. The triangular structure implies that we have more parallelism for small 
values of k. Therefore, we expect increased speedups at earlier stages of the algorithm in Figure[2j 

The main ingredients in the kernels Normalize() and Nornialize_Reniove() are inner products 
and the normalizations, which we explain in the next two subsections. In subsection C we discuss 
the usage of the card resources by threads of the kernel Nornialize_Remove(). 

5.1 computing inner products 

The warp size of the Tesla C2050 equals 32, and 32 is our typical dimension. In computing x^y 
the products Xi * yi are independent of each other. The inner product x^y is computed in two 
stages: 

1. All threads work independently in parallel: thread i calculates x^-kyi where the operation 
* is a complex double, a complex double double, or a complex quad double multiplication. 
Afterwards, all threads in the block are synchronized for the next stage. 

2. The application of a sum reduction [lOl Figure 6.2] to sum the elements in {xiyi,X2y2, ■ ■ ■ ,Xmym) 
and compute xiyi + X2y2 + ■ ■ ■ +^mym- The + in the sum above corresponds to the ★ in the 
item above and is a complex double, a complex double double, or a complex quad double 
addition. There are log2(m) steps but as the m typically equals the warp size, there is 
thread divergence in every step. 

The number of shared memory locations used by an inner product equals 2m, times the space 
for a complex double, a complex double double, or a complex quad double. 
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The 2m memory locations suffice if we have to compute only one inner product and if we 
allow that one of the original vectors is overwritten. In our algorithm, we need the same vector 
q/c multiple times when computing r^j := q_^SLj (see Figure [1]) so we need an extra m shared 
memory locations to store q^t'^'^ji for ^ = 1, 2, . . . , m. Then the computation of aj := — r^jC^k 
requires another m memory locations to store the products r^j -k q^g for £ = 1, 2, . . . , m. So in 
total we have 4m memory locations in shared memory in the kernel Normalize_Reinove(). 

5.2 the orthonormalization stage 

After the computation of the inner product a^a^, the orthonormalization stage consists in one 
square root computation, followed by m division operations. 

The first thread in a block performs the square root calculation r^.^ := sl^sl^ and then, after 
a synchronization, the m threads in a block independently perform the divisions a^i := dke/fkki 
for £ = 1, 2, . . . , m to compute q^, stored in the location of a^. 

Dividing each component of a vector by the norm happens independently, and as the cost of 
the division increases for complex doubles, complex double doubles, and complex quad doubles, 
so we could expect an increased parallelism as we increase the working precision. Unfortunately, 
also the cost for the square root calculation — executed in isolation by the first thread in each 
block — also increases as the precision increases. 

5.3 the occupancy of multiprocessors 

In the occupancy calculations, we take m = n, and consider the use of complex double double 
arithmetic. Concerning the occupancy of the multiprocessors, one thread block uses 

4 X n X size_of( complex double double) (3) 

bytes of shared memory, which amounts to a bit more than 4000 bytes. Also a thread block uses 
about 3000 registers. The number of blocks scheduled per multiprocessor is 8. It is actually the 
maximum number of blocks which could be scheduled per multiprocessor for the device of compute 
capability 2.0. Neither allocated per block shared memory, neither the number of registers used do 
not appear as the limiting factor on the number of blocks scheduled per multiprocessor. Although 
shared memory and registers of a multiprocessor are employed quite well: 8 blocks of threads use 
about 

100 X 4, 224 X 8/49, 152 ^ 68% (4) 

of shared memory capacity, and 

100 X 3, 072 X 8/32, 768 ^ 75% (5) 

of available registers. For dimension 32, the orthogonalization launches the kernel Normal- 
ize_Remove() 31 times, while first 7 of these launches employ 4 multiprocessors, launches from 
8 to 15 employ 3 multiprocessors, 16-23 employ 2 multiprocessors, and finally launches 24-31 
employ only one multiprocessor. 
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5.4 data movement 

At the beginning of the kernel thread £ of a block reads the ^th component of the vector 
from the global memory into the ith. location of the first column of the shared memory 4 x m 
two dimensional array Sh_Locations of complex numbers of the given precision allocated by the 
block. Subsequently it reads the ith component of the vector a^ into the ith location of the second 
column of Sh_Locations. Both readings are prescribed to be done simultaneously by threads of 
the block. 

For double and double double precision levels we achieve coalesced access to the global memory 
but not for complex quad double numbers. This could explain why the speedups do not increase 
as we go from complex double double to the complex quad double versions of the parallel Gram- 
Schmidt algorithm. 

6 the back substitution kernel 

After the computation of Q and R, denoting Q^h by y, we have to solve the triangular system 
Rx = y. Because of the low dimension of our application, only one block of threads will be 
launched. Pseudo code for a parallel version of the back substitution algorithm is shown in 
Figure [3l 

The most natural order for the parallel version of the back substitution appears to be to 
process the matrix ii in a column fashion. In the kth step we multiply the kth column of R by 

and subtract the product from the right hand side vector y updated by such subtractions at 
all the previous steps. 

Input: R G C"^"", an upper triangular matrix, 

y G C", the right hand side vector. 
Output: X is the solution of -Rx = y. 
for k from n down to 1 do 
thread k does := Vk/fkk 
all threads synchronize 
for j from 1 to A; — 1 do 

thread j does yj := yj — Vj^ -k 
all threads synchronize 

Figure 3: Pseudo code for a parallel back substitution. 

Ignoring the cost of synchronization and thread divergence and with the warp size 32 equal 
to the dimension n, the parallel execution reduces the inner loop to one step. With focus on the 
arithmetical cost, the total number of steps equals 2n. Note that the more costly division operator 
is done by only one thread. More precisely than 2n, the arithmetical cost of the algorithm in 
Figure [3] is n divisions, followed by n multiplications and n subtractions. 

During the execution of the parallel back substitution, the right hand side vector y remains 
in shared memory. At each step k, the current column A; of i? is loaded into shared memory for 
processing. 
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Because only one block can be occupied for this stage, running the same code at different 
levels of precision provides a free error estimate for the solution of the linear system. 

7 Computational Results 

Our computations were done on an HP Z800 workstation running Red Hat Enterprise Linux. For 
speedups, we compare the sequential run times on one core of an 3.47 Ghz Intel Xeon X5690. 
The clock speed of the NVIDIA Tesla C2050 is at 1147 Mhz, about three times slower than the 
clock speed of the CPU. The C++ code is compiled with version 4.4.6 of gcc and we use release 
4.0 of the NVIDIA CUDA compiler driver ^7]. 

We ran the modified Gram-Schmidt method on 32 random complex vectors of dimension 32. 
The times and speedups are shown in Table [6l 

Table 6: Wall clock times and speedups for 10,000 orthogonalizations on 32 random complex 
vectors of dimension 32. 



precision 


1 CPU core 


Tesla C2050 


speedup 


complex double 


13.376 sec 


5.339 sec 


2.5 


complex double double 


115.63 sec 


16.5 sec 


7.0 


complex quad double 


785 sec 


108 sec 


7.3 



In Table [6] compare the 115.63 seconds on 1 CPU core to perform the calculations in double 
double arithmetic with the 108 seconds to do the same computations on a GPU in quad double 
arithmetic. Using a GPU we achieve a quality up: in about the same time we obtain more 
accurate orthogonalizations. 

Table 7: Wall clock times and overall speedups for 10,000 orthogonalizations on 32 random 
complex vectors of dimension 32 and for 10,000 polynomial evaluations and differentiations of 
polynomial system of 32 equations with 32 variables, with 32 monomials per polynomial, with 5 
variables in each monomial, with variable degrees uniformly taken from {1,2,3,4,5}. 



precision 


CPU PE 


GPU PE 


speedup 


complex double 


11 sec 


1.3 sec 


8.5 


complex double double 


66 sec 


2.1 sec 


31.4 


complex quad double 


396 sec 


14.2 sec 


27.9 


precision 


CPU MGS 


GPU MGS 


speedup 


complex double 


13.4 sec 


5.3 sec 


2.5 


complex double double 


115.6 sec 


16.5 sec 


7.0 


complex quad double 


785.0 sec 


108.0 sec 


7.0 


precision 


CPU PE+MGS 


GPU PE+MGS 


speedup 


complex double 


24.4 sec 


6.6 sec 


3.7 


complex double double 


181.6 sec 


18.6 sec 


9.8 


complex quad double 


1181.0 sec 


122.2 sec 


9.7 
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Table 8: Wall clock times and overall speedups for 10,000 orthogonalizations on 32 random 
complex vectors of dimension 32 and for 10,000 polynomial evaluations and differentiations of 
polynomial system of 32 equations with 32 variables, with 32 monomials per polynomial, with 12 
variables in each monomial, with variable degrees uniformly taken from {1, 2, . . . , 10}. 



precision 


CPU PE 


GPU PE 


speedup 


complex double 


22.5 sec 


1.8 sec 


12.5 


complex double double 


135.0 sec 


4.1 sec 


32.9 


precision 


CPU MGS 


GPU MGS 


speedup 


complex double 


13.4 sec 


5.3 sec 


2.5 


complex double double 


115.6 sec 


16.5 sec 


7.0 


precision 


CPU PE+MGS 


GPU PE+MGS 


speedup 


complex double 


35.9 sec 


7.1 sec 


5.1 


complex double double 


250.6 sec 


20.6 sec 


12.2 



In Tables [7] and [8] projected overall speedups are computed for Tesla 2050 path tracking. We 
see that speedups are higher as there are more variables in the monomials and as the degrees of 
the variables are higher. Our polynomial evaluation and differentiation kernels have to be tuned 
so that their thread blocks would use more multiprocessor registers than shared memory for to 
be able to work with quad double arithmetic for systems of characteristics as in Table [HI 

8 Conclusion 

Using a massively parallel algorithm for the modified Gram-Schmidt orthogonalization on a 
NVIDIA Tesla C2050 Computing Processor we can compensate for the cost of one extra level of 
precision, even already for modest dimensions. 
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